Benjamin King, PhD | bking@mdibl.org | Kyle Shank | kshank@mdibl.org |
Align a set of RNA-seq reads from the Mycobacterium phage Giles to the Mycobacterium smegmatis MC2 155 genome assembly and perform a basic analysis.
The overall process consists of five discrete steps:
R.Before proceeding, you must register for a free account on Galaxy. It’s fast, and importantly, free. After logging in, you should arrive at a screen that looks like this:
In this step, we’re going to download the genome assembly (a FASTA file) and the annotation (a gtf file) onto our local machine from EnsemblBacteria.
Click on this link to download the M. Smegmatis MC2 155 genome assembly.
Beneath the section labeled “Gene Annotation”, click on the FASTA link.
This is the proper place to click to fetch the FASTA file
When prompted, make sure to choose to continue your download as a Guest. A directory will then be downloaded. Within it are several sub-directories: cdna, cds, dna, ncrna, pep. Open dna. From here, save the file marked Mycobacterium_smegmatis_str_mc2_155.ASM1500v1.dna_sm.toplevel.fa.gz to your working directory on your local machine.
Genome annotation can be one of the more difficult problems tackled in bioinformatics, mostly due to the plethora of file formats and transformation tools that are available. To simplify the task in this particular study, we have provided a suitable GTF file, availble here.
From the Galaxy homepage, select Get Data.
Next, click Upload File from your computer
Drag and drop both files that you wish to upload. Under Type, make sure to change the set the Geneome Assembly file to fasta and the Genome Annotation file to gtf. Click Start.
Click close. You can now see in your history bar (on the right) that you’ve successfully uploaded both files.
The reads for the Giles RNA-Seq study were initially deposited in the NCBI Short Read Archive. The European Nucleotide Archive (ENA) has mirrored those data and has made it easy to upload the FASTQ files into Galaxy.
Navigate to the ENA and enter the Gene Expression Omnibus accession for this study, GSE43434. Click Search.
On the results page, select the first link (SRP017906) to look at the study.
On the results page, make note of the SRR- values in the Run accession column. Each of these files is an individual run through the sequencing machine. Copy the first string you see (SRR647673).
Return to Galaxy. From the toolbar, select NCBI SRA Tools. Then select Extract reads in FASTQ/A format from NCBI SRA.
On this page, paste your copied run accession code (SRR647673) into the appropriate blank field. Then click execute.
This will add the file (as a pending job) to your history on the right. Not that this process can take quite a bit of time to complete. Repeat the above for the other three SRR- files (SRR647674,SRR647675). Note that these files will be in the fastqsanger format.
Quality control is an important step in bioinformatic (and all general data analysis) pipelines. We will be using FastQC. FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
The main functions of FastQC are:
From the Galaxy toolbar, click NGS: QC and manipulation. Then click FastQC.
Select one of your FASTQ files from the history to read in. Note that these files may have different referneces in your own history (in the example, the 3 FASTQ files imported from ENA are called 9: Extract Reads, 10: Extract Reads, and 11: Extra Reads). Then click Execute
You will see two new additions to your history bar: a FastQC “RawData” file, and a FastQC “Webpage”. You can download this file to your desktop and examine the FastQC Output. The main file of interest is the html file. Upon opening, you should see something similar to this:
Mapping refers to the process of aligning short reads to a reference sequence, whether the reference is a complete genome, transcriptome, or de novo assembly. There are numerous programs that have been developed to map reads to a reference sequence that vary in their algorithms and therefore speed. The program that we utilize in this pipeline is called bowtie. More information available here
Click NGS: Mapping. Then click Map with Bowtie for Illumina.
In the first blank area (“Will you select a reference genome…?”), select Use one from the history. Then, choose your reference genome (Mycobacterium_smegmatis_str_mc2_155.ASM1500v1.dna_sm.toplevel.fa.gz). Leave the rest of the settings as they are - but make note of the FASTQ file that you are performing the mapping on, as you’ll need to repeat this step for each of the three FASTQ files that you’ve loaded into Galaxy. When you’re ready, click Execute.
Repeat this step for the remaining two FASTQ files. Note that your output files will be SAM files. For more information on SAM/BAM files, click here
To perform differential analysis, it’s necessary to be able to calculate the number of reads mapping to each feature. Here, we think of a feature as an interval (i.e., a range of positions) on a chromosome or a union of such intervals. In the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons. One may also consider each exon as a feature, e.g., in order to check for alternative splicing.
To perform this task, we will use the htseq-count program.
From the toolbar, click NGS: RNA Analysis. Then click htseq-count.
In the first input (“Aligned SAM/BAM file”), select one of the SAM files generated from bowtie. Then, make sure your gtf file is in the second input (“GFF File”). Leave the rest of the parameters as they are. Click Execute.